Transport in London¶

Introduction¶

Established in 2000, Transport for London (TfL) stands as the vital transport network within the United Kingdom's capital. Its role is to facilitate the London’s web of transportation systems throughout the city, integrating various modes of public transportation, including buses, the London Underground (Tube), London Overground (LO), trams, and more, under a governance structure. The early roots of TfL trace back to earlier networks responsible for London's transportation, named London Transport formed in 1933. When the TfL was created, this marked a pivotal moment in the evolution of London's transportation system, as it created an approach to a more cohesive urban mobility system. Currently overseeing an intricate network that spans approximately 700 miles of roads and 1,000 kilometres of cycle lanes, the TfL has also revolutionised fare payment with over 1 billion contactless journeys annually, and contributed to the reduction of carbon emissions, introducing hybrid and electric buses. Furthermore, the TfL holds global significance, as its achievements in congestion management, environmental sustainability, and accessibility have inspired cities worldwide to recalibrate their own transit ecosystems.

Many determinants influence millions of TfL journeys that occur daily, including economic shifts, population dynamics, technological advancements, urban development, and policy interventions. By analysing historical and contemporary factors that influence the number of journeys across TfL’s various transport modes, we aim to gain key insights and a broader understanding of the future of London's transport network. Through this exploration, we unveil patterns and trends that contribute to informed decision-making in urban planning and policy making, which will ultimately pave the way for a more resilient and responsive transport future in London.

Image

For our project, we have chosen to explore the number of journeys across different modes of public transport in London over the years from April 2010 to the December 2023. We have leveraged multiple datasets reliably sourced from the London Datastore, the TfL and Government websites to analyse for this project. We delve into understanding the trends associated with number of journeys periodically, across various transportation modes such as buses, London Underground, Dockland Light Railway (DLR), London Tramlink, and London Overground. We also explore other determinants such as transport crime data emphasised across these modes, traffic volumes across various local authorities in London and performance levels of the transport system. With this we can uncover the intrinsic significance public transport plays in urban safety and public welfare in London. Focusing on temporal and modal variations of the millions of journeys that occur every day, we can unearth insights into the crucial role that transport networks play in the daily lives of millions of the city's inhabitants, and predict future number of journeys made throughout the TfL system. This in turn can aid with urban planning, law enforcement, and future policy making concerning the overall well-being of London's commuters. In this notebook, we analyse previous travel data, uncovering insightful trends through Exploratory Data Analysis to answer important research questions around factors influencing these patterns. We also use regression analysis to predict future number of journeys ultimately contributing towards a goal of fostering a more efficient, connective and secure public transportation environment for the citizens of London.

Research questions and objectives¶

  1. Which events impacted significantly the number of journeys on the entire TfL network over the years?
  2. Are there any considerable events that made passengers switch across means of transportation?
  3. How has overall trend of transport-related crimes in London changed over years 2009-2023?
  4. What are the key factors influencing the number of TfL journeys?
  5. Can historical data on TfL journeys be used to predict future passenger numbers accurately, and can we create a model to predict the number of TfL journeys in 2024?
In [1]:
# Uncomment the following line if the "shutup" package is not already installed on your device
#!pip install shutup
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
import shutup; shutup.please()

Data¶

Data sources¶

The datasets we have chosen to analyse in this project, provided in the form of Excel and CSV spreadsheets, include:

  1. Public Transport Journeys by Type of Transport from the London DataStore
  2. Underground services performance – Service Operated, from Transport for London
  3. Transport Crime in London from the London DataStore
  4. Traffic Flows in Kilometres from GOV.UK

These datasets are licenced from the UK government website, ensuring reliability and accuracy of regularly up-to-date travel data. The website offers a user-friendly interface for easy accessibility to the public, which again establishes its ethical standards of complete transparency to its users.

The "Public Transport Journeys by Type of Transport" data provides the number of journeys in millions on the public transport network by type of transport, broken down by bus, underground, DLR, tram, Overground and cable car. The data is presented as an Excel Spreadsheet, from 28/04/2010 to 09/12/2023, monthly. The data is reported for different periods, with varying period lengths in periods 1 and 13, reported data is now adjusted for the differences in period lengths. Journey counts for the Docklands Light Railway are derived from automatic passenger counts at stations, while Overground and Tram journeys are based on automatic on-carriage passenger counts. Additionally, reliable journey numbers for the Overground have been available only since October 2010.

The "Underground services performance – Service Operated" data provides performance of specific underground lines against their key performance metrics. It is presented in a CSV file, reporting periods containing 4 weeks, with 13 periods in a financial year, from April 2018 – December 2023. This data compares the actual number of tube trips versus the scheduled number of tube trips over time using a predetermined set of measuring points. It is based on current working timetables, adjusted for planned closures, engineering works, impacts of three hours or more due to system issues, industrial action, force majeure, or unplanned events.

The "Transport Crime in London" data provides monthly breakdowns of crime volume and rate of crime per million passenger journeys, in an Excel Spreadsheet, from 01/04/2009 to 31/03/2023. The data provides reporting of rail-related crimes, such as LO, LU, DLR and Tramlink networks handled by the British Transport Police (BTP), and the bus network overseen by the Metropolitan Police Service (MPS). London Overground data contains missing observations before April 2011 due to passenger journey information limitations. TfL rail and DLR were introduced to the data in 2017 and 2018 respectively, and the TfL rail is later renamed to the Elizabeth Line in 2023.

The "Traffic Flows" data provides annual traffic volume by total distance travelled by all motor vehicles on the roads of London, in million kilometres. This data is presented in an Excel Spreadsheet, from 1993 to 2022, and broken down by London Boroughs. Some key events that affected these figures include the September 2000 fuel protest, Foot and Mouth disease in 2001 and the coronavirus (COVID-19) pandemic.

In [3]:
# Importing journeys data
n_journeys = pd.read_excel('data/tfl-journeys-type.xlsx', 'Journeys', parse_dates=['Period ending']).iloc[:,4:]

# Importing traffic flows data
traffic = pd.read_excel('data/traffic-flow-borough.xlsx', 'Traffic Flows - Cars')

# Importing crime data
crime_10 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4)
crime_11 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=7)
crime_12 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=14)
crime_13 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=21)
crime_14 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=29)
crime_15 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=37)
crime_16 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 4, skiprows=45)
crime_17 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 5, skiprows=53)
crime_18 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=62)
crime_19 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=72)
crime_20 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=82)
crime_21 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=92)
crime_22 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=102)
crime_23 = pd.read_excel('data/public-transport-crime-london.xlsx', 'Volume and Rates', header = [0,1], index_col = 0, nrows = 6, skiprows=112)

# Importing performance data
service_operated = pd.read_csv('data/service-operated.csv').iloc[:,:9]

Data cleaning¶

Initially, we import all data into the Python notebook and carry out the cleaning process to remove missing variables and duplicates. We further provide the data in a formatted table, easier to read, analyse and manipulate, for example with number of journeys data, we provide dates of reported journeys as rows, and number of journeys across different transport modes as the columns, as shown below.

Number of journeys¶

In [4]:
# Changing the name of the columns
n_journeys.columns=['date','bus_n','tube_n','dlr_n','tram_n','og_n','cable_n','eliz_n']

# Setting dates as index
n_journeys.set_index('date',inplace=True)

# Aggregating data by years
n_journeys_y = n_journeys.groupby(n_journeys.index.to_period('Y').to_timestamp()).sum(min_count = 1)

# Aggregating data by month
n_journeys_m = n_journeys.groupby(n_journeys.index.to_period('M').to_timestamp()).mean()

# Aggregating data for all the means of transportation
n_journeys_total = n_journeys.agg('sum', axis=1)

n_journeys.tail()
Out[4]:
bus_n tube_n dlr_n tram_n og_n cable_n eliz_n
date
2023-08-19 129.710544 86.647955 7.213507 1.425394 12.375553 0.177132 15.359475
2023-09-16 139.594975 83.658393 7.301096 1.688012 13.837056 0.142462 15.627150
2023-10-14 155.412029 93.392539 8.280110 1.631487 14.841794 0.093054 17.298925
2023-11-11 146.374533 95.801186 7.759405 1.268640 14.983251 0.100782 17.838075
2023-12-09 150.650065 101.769430 7.925633 1.734252 14.982537 0.065571 17.778625

In the cleaning process of the number of journeys. The columns' names were changed to more meaningful ones, specifying the types of transport (bus, tube, DLR, tram, Overground, cable car, Elizabeth line). The 'date' column was set as the index, facilitating time-based analysis, and data was then aggregated by years (n_journeys_y) and by months (n_journeys_m). Additionally, a new series (n_journeys_total) was created, representing the total number of journeys across all means of transportation.

In [5]:
n_journeys.describe()
Out[5]:
bus_n tube_n dlr_n tram_n og_n cable_n eliz_n
count 178.000000 178.000000 178.000000 178.000000 171.000000 149.000000 112.000000
mean 159.202482 88.861158 7.599498 1.996925 11.389076 0.110158 5.295398
std 35.125796 23.992997 1.885426 0.458678 3.495306 0.059092 4.227002
min 30.223736 5.745632 1.205125 0.440934 0.999693 0.000169 0.594615
25% 145.235034 84.620162 6.485058 1.688709 8.944797 0.075648 3.276068
50% 173.661797 94.207425 7.739988 2.157100 11.582781 0.110490 3.768055
75% 182.448439 105.077546 9.162448 2.315166 14.398171 0.132138 4.642658
max 207.509939 118.222383 10.636562 2.765871 17.820632 0.534218 17.838075
In [6]:
fig_box = go.Figure()
fig_box.add_trace(go.Box(y=n_journeys['tram_n'], name= 'tram_n', notched=True, boxpoints='all', fillcolor= 'lawngreen'))
fig_box.add_trace(go.Box(y=n_journeys['eliz_n'], name= 'eliz_n', notched=True, boxpoints='all', fillcolor= 'darkviolet'))
fig_box.add_trace(go.Box(y=n_journeys['dlr_n'],  name= 'dlr_n', notched=True, boxpoints='all', fillcolor= 'turquoise'))
fig_box.add_trace(go.Box(y=n_journeys['og_n'],   name= 'og_n', notched=True, boxpoints='all', fillcolor= 'orange'))
fig_box.add_trace(go.Box(y=n_journeys['tube_n'], name= 'tube_n', notched=True, boxpoints='all', fillcolor= 'blue'))
fig_box.add_trace(go.Box(y=n_journeys['bus_n'],  name= 'bus_n', notched=True, boxpoints='all', fillcolor= 'red'))
fig_box.show()

Exploring the number of journeys across different transport types, through creating box-plots, we can clearly observe buses seeing a significantly higher passenger rate monthly, relative to its rail-service counterparts. The high median and large interquartile range suggests a positive skew in the number of bus passengers, stretching the distribution towards higher values. This can be due to higher accessibility, as buses tend to offer more door-door services compared to tubes or trams, especially for shorter distances. Other relevant factors include, being more cost effective, providing local connectivity between proximal neighbourhoods, and covering a more extensive area, all which attracts a diverse range of passengers.

We created an interactive boxplot with the ability to turn the legends on and off for different transport modes. This interactive feature enables viewers to compare the categories with ease, for example, as seen in the plot, tube and bus volumes are substantially higher than the other modes, so one can switch off these 2 legends, to more visibly compare the modes with significantly fewer number of journeys. This interactivity also allows users to explore the data dynamically, providing additional context on specific values, such as summary statistics, as a more hands-on approach to understand the trends within the data.

Traffic flows¶

In [7]:
traffic_lnd = traffic[traffic['Local Authority']=='London'].transpose()[2:]
traffic_lnd.columns=['traffic']
traffic_lnd.index = pd.Series(pd.date_range("1993-01-01", periods=len(traffic_lnd), freq=pd.offsets.YearBegin()))
traffic_lnd.head()
Out[7]:
traffic
1993-01-01 25560.0
1994-01-01 25851.0
1995-01-01 25755.0
1996-01-01 26009.0
1997-01-01 26119.0

Crime data¶

Over the years, as TfL shaped the way Londoners move, it has also faced challenges of ensuring the safety and well-being of travellers. From the early days of London Transport, the city has witnessed the significance of safeguarding passengers from various security threats as the transport infrastructure expanded. Transport crimes range from petty offenses to more serious incidents, across various modes of transport connecting through diverse neighbourhoods and urban landscapes, with policing strategies and security measures continuously evolving with considerations of safety to urban mobility.

In [8]:
lis = ['Vol','Rate']*12
# Setting the columns of the final output
crime = pd.DataFrame(columns = ['bus_vol', 'bus_rate','tube_vol', 'tube_rate', 'dlr_vol', 'dlr_rate',
                                'tram_vol', 'tram_rate', 'og_vol', 'og_rate', 'eliz_vol','eliz_rate'])

for i in range(10, 24):
    # Creating some temporary dataframes for each year
    df = globals()['crime_'+str(i)]
    
    if len(df) == 4:
        df.index = ['Bus', 'London Underground', 'London Overground', 'Trams']
    elif len(df) == 5:
        df.index = ['Bus', 'London Underground', 'London Overground', 'Tfl Rail', 'Trams']
    else:
        df.index = ['Bus', 'London Underground', 'Docklands Light Railway', 'London Overground', 'Tfl Rail', 'Trams']
        
    # Setting the columns of the output dividing per year  
    globals()['table_'+str(i)] = pd.DataFrame(columns = ['bus_vol', 'bus_rate','tube_vol', 'tube_rate', 'dlr_vol',
                                                         'dlr_rate', 'tram_vol', 'tram_rate', 'og_vol', 'og_rate',
                                                         'eliz_vol','eliz_rate'])
    invert = df.transpose()
    invert.index = lis
    for row in range(len(globals()['crime_'+str(i)])):
        
        # Using the temporary dataframes to fill the output dataframe per year
        if df.index[row] == 'Bus':
            column = invert.loc['Vol']['Bus']
            globals()['table_'+str(i)]['bus_vol'] = column.to_list()
            column = invert.loc['Rate']['Bus']
            globals()['table_'+str(i)]['bus_rate'] = column.to_list()
        if df.index[row]  == 'London Underground':
            column = invert.loc['Vol']['London Underground']
            globals()['table_'+str(i)]['tube_vol'] = column.to_list()
            column = invert.loc['Rate']['London Underground']
            globals()['table_'+str(i)]['tube_rate'] = column.to_list()
        if df.index[row]  == 'London Overground':
            column = invert.loc['Vol']['London Overground']
            globals()['table_'+str(i)]['og_vol'] = column.to_list()
            column = invert.loc['Rate']['London Overground']
            globals()['table_'+str(i)]['og_rate'] = column.to_list()
        if df.index[row]  == 'Trams':
            column = invert.loc['Vol']['Trams']
            globals()['table_'+str(i)]['tram_vol'] = column.to_list()
            column = invert.loc['Rate']['Trams']
            globals()['table_'+str(i)]['tram_rate'] = column.to_list()
        if df.index[row]  == 'Docklands Light Railway':
            column = invert.loc['Vol']['Docklands Light Railway']
            globals()['table_'+str(i)]['dlr_vol'] = column.to_list()
            column = invert.loc['Rate']['Docklands Light Railway']
            globals()['table_'+str(i)]['dlr_rate'] = column.to_list()
        if df.index[row]  == 'Tfl Rail':
            column = invert.loc['Vol']['Tfl Rail']
            globals()['table_'+str(i)]['eliz_vol'] = column.to_list()
            column = invert.loc['Rate']['Tfl Rail']
            globals()['table_'+str(i)]['eliz_rate'] = column.to_list()
            
    # Concatenating the output per year in only one dataset
    crime = pd.concat([crime,globals()['table_'+str(i)]])

# Setting the DatetimeIndex accordingly
crime.index = pd.Series(pd.date_range("2009-04-01", periods=168, freq="M"))
crime.replace('-', np.nan, inplace = True)
crime = crime.astype(float)
crime_m = crime.groupby(crime.index.to_period('M').to_timestamp()).mean()

crime_m.tail()
Out[8]:
bus_vol bus_rate tube_vol tube_rate dlr_vol dlr_rate tram_vol tram_rate og_vol og_rate eliz_vol eliz_rate
2022-11-01 1669.0 10.6 1725.0 18.1 98.0 12.1 37.0 20.6 160.0 10.7 139.0 8.5
2022-12-01 1396.0 10.1 1552.0 17.5 76.0 11.0 33.0 21.6 128.0 12.2 142.0 9.5
2023-01-01 1523.0 10.2 1687.0 19.1 67.0 8.7 41.0 22.6 123.0 9.2 108.0 6.9
2023-02-01 1436.0 10.2 1682.0 19.0 81.0 10.4 24.0 14.3 135.0 10.7 130.0 8.2
2023-03-01 1729.0 13.2 1917.0 24.9 85.0 12.0 28.0 18.5 171.0 14.5 128.0 9.6

In the cleaning process of crime numbers, a list was defined to specify the types of crimes (volume and rate) for each month. The final output DataFrame (crime) was created with columns representing different transportation modes and crime metrics. A loop was utilized to process each yearly dataset, and temporary DataFrames (table_10 to table_23) were created for each year to organize and structure the crime data. The loop involved transposing the original data, adjusting the index, and populating the output DataFrame accordingly. The resulting crime DataFrame was then indexed with a DatetimeIndex and further aggregated by monthly mean to create crime_m. Additionally, missing values represented as '-' were replaced with NaN, and the entire DataFrame was converted to a float data type. The final crime_m DataFrame provides a consolidated and organized view of crime-related data across different transportation modes and metrics over the specified time period.

In [9]:
crime_m.describe()
Out[9]:
bus_vol bus_rate tube_vol tube_rate dlr_vol dlr_rate tram_vol tram_rate og_vol og_rate eliz_vol eliz_rate
count 168.000000 168.000000 168.000000 168.000000 72.000000 72.000000 168.000000 168.000000 168.000000 144.000000 84.000000 84.000000
mean 1496.053571 8.829991 1085.892857 12.249847 56.361111 8.128765 23.517857 10.989226 90.761905 8.907276 60.190476 14.408868
std 342.314616 1.862150 323.753640 6.522995 16.303724 3.778607 8.938227 4.018835 34.224206 4.894076 26.870271 5.937156
min 450.000000 6.100000 322.000000 6.300000 28.000000 3.200000 5.000000 2.900000 20.000000 4.300000 24.000000 6.600000
25% 1305.750000 7.344459 895.750000 8.575000 44.000000 5.575000 17.000000 8.200000 64.000000 6.600000 38.750000 9.975000
50% 1485.000000 8.310099 1046.000000 10.600000 55.500000 7.525941 22.000000 10.300000 88.000000 7.750000 53.000000 13.426945
75% 1661.500000 10.000000 1209.500000 13.525000 66.250000 9.575000 29.250000 13.500000 116.250000 9.525000 74.250000 17.132027
max 2402.000000 16.900000 2597.000000 63.400000 107.000000 26.600000 49.000000 22.600000 171.000000 50.100000 142.000000 38.300000
In [10]:
fig_viol_vol = go.Figure()
fig_viol_vol.add_trace(go.Violin(y=crime['bus_vol'], name='bus_vol', line_color='red'))
fig_viol_vol.add_trace(go.Violin(y=crime['tube_vol'], name='tube_vol', line_color='blue'))
fig_viol_vol.add_trace(go.Violin(y=crime['dlr_vol'], name='dlr_vol', line_color='turquoise'))
fig_viol_vol.add_trace(go.Violin(y=crime['tram_vol'], name='tram_vol', line_color='lawngreen'))
fig_viol_vol.add_trace(go.Violin(y=crime['og_vol'], name='og_vol', line_color='orange'))
fig_viol_vol.add_trace(go.Violin(y=crime['eliz_vol'], name='eliz_vol', line_color='darkviolet'))
fig_viol_vol.update_traces(box_visible=True, points='all', jitter = 0.05, meanline_visible=True)
In [11]:
fig_viol_rate = go.Figure()
fig_viol_rate.add_trace(go.Violin(y=crime['bus_rate'], name='bus_rate', line_color='red'))
fig_viol_rate.add_trace(go.Violin(y=crime['tube_rate'], name='tube_rate', line_color='blue'))
fig_viol_rate.add_trace(go.Violin(y=crime['dlr_rate'], name='dlr_rate', line_color='turquoise'))
fig_viol_rate.add_trace(go.Violin(y=crime['tram_rate'], name='tram_rate', line_color='lawngreen'))
fig_viol_rate.add_trace(go.Violin(y=crime['og_rate'], name='og_rate', line_color='orange'))
fig_viol_rate.add_trace(go.Violin(y=crime['eliz_rate'], name='eliz_rate', line_color='darkviolet'))
fig_viol_rate.update_traces(box_visible=True, points='all', jitter = 0.05, meanline_visible=True)

Assessing the crime rate across transportation modes, rather than crime volume, gives us a much more accurate comparison, easier to visualise. These violin box plots are useful for showing the distribution and probability density of a crime data across different transport modes. When interpreting distribution of the data, we can observe that the width of all types of transportation are very similar, and relatively symmetrical, suggesting similar probability density and small skewness. Another key observation is the elongated tail for tube and overground crime rate above the violin shape, suggesting that the spread of outliers are relatively extreme.

Performance data¶

Imported data on performance, which we measure as the percentage of service operated (number of train departures compared to the scheduled one). When cleaning performance data, we converted the percentual data to floats, and fixed the financial period of 13 months that was provided in the data, to 12 calendar months, for ease of manipulation.

In [12]:
# Selecting only the summed up general network data
service_tube = service_operated[service_operated.Line == 'Network'].reset_index(drop=True)

# Converting the percentual to float
service_tube = pd.DataFrame(service_tube['Service Operated for Period - All Week'].str.rstrip('%').astype(float)/100)
service_tube.columns = ['performance']

# Fixing the index using financial months as in the dataset
service_tube.index = pd.Series(pd.date_range("2018-01-01", periods=len(service_tube), freq='28D'))

# Aggregating the data monthly taking the mean to make comparison with the other datasets
service_tube_m = service_tube.groupby(service_tube.index.to_period('M').to_timestamp()).mean()
service_tube.tail()
Out[12]:
performance
2023-05-15 0.910
2023-06-12 0.907
2023-07-10 0.920
2023-08-07 0.912
2023-09-04 0.900

Visualisations¶

1. Which events impacted significantly the number of journeys on the entire TfL network over the years?¶

Data Visualisation makes it easier to grasp patterns and relationships within complex, numerical datasets in a clear and concise way. Using Plotly, we created line plots comparing the number of passengers and Traffic, over time. Visualising the data in a temporal fashion can help predict future trends in journeys and aid decision-making regarding travel policing on London.

In [13]:
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys_total, name="tfl_n"), secondary_y=False)
fig.add_trace(go.Scatter(x=traffic_lnd.index, y=traffic_lnd['traffic'], name="traffic"), secondary_y=True)

# Add figure title
fig.update_layout(title_text="Number of TfL Customers vs Traffic",
                  xaxis_range=[dt.date(2010,1,1), dt.date(2022,10,1)],xaxis_rangeslider_visible=True)
# Set x-axis title
fig.update_xaxes(title_text="Date")

# Set y-axes titles
fig.update_yaxes(title_text="Millions of Passengers", secondary_y=False)
fig.update_yaxes(title_text="Millions of Kilometres Travelled by Cars", secondary_y=True)

fig.show()
In [45]:
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['tram_n'], fill='tozeroy', line_color= 'lawngreen',
                         name='tram_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['eliz_n'], fill='tonexty', line_color= 'darkviolet',
                         name='eliz_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['dlr_n'],  fill='tozeroy', line_color= 'turquoise',
                         name='dlr_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['og_n'],   fill='tonexty', line_color= 'orange',
                         name='og_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['tube_n'], fill='tonexty', line_color= 'blue',
                         name='tube_n')) 
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys['bus_n'],  fill='tonexty', line_color= 'red',
                         name='bus_n'))
fig1.add_trace(go.Scatter(x=n_journeys.index, y=n_journeys_total,     fill='tonexty', line_color= 'navy',
                         name='tfl_n'))

# Events on London Underground
signif_dates_tube = [dt.date(2012,12,8),dt.date(2021,9,18)]
days_tube = [dt.date(2012,12,8),dt.date(2021,9,18)]

scatter_tube = n_journeys.tube_n[n_journeys.index.isin(signif_dates_tube)]

actual_text_tube= ['<br>Introduction of Contactless Payment</br>',
                   '<br>Northern Line Extension:</br>Kennington to Battersea']

hovertext_tube = ['<b>' + str(day) + '</b>' + actual_text_tube[i] for i, day in enumerate(days_tube)]

fig1.add_trace(go.Scatter(x=scatter_tube.index, y=scatter_tube, mode='markers',
                          name='events_tube', hovertext=hovertext_tube, hoverinfo="text",
                          marker=dict(color="red", size = [10]*3)))

# Events on London Overground
signif_dates_og = [dt.date(2011,3,5),dt.date(2012,12,8),dt.date(2015,5,30)]
days_og = [dt.date(2011,2,28),dt.date(2012,12,9),dt.date(2015,5,31)]

scatter_og = n_journeys.og_n[n_journeys.index.isin(signif_dates_og)]

actual_text_og= ['<br>OG Addition:</br>Dalston Junction to Highbury & Islington',
               '<br>OG Addition:</br>Surrey Quays to Clapham Junction',
               '<br>OG Addition:</br>Liverpool Street to Enfield Town, Cheshunt and Clingford<br>\
Romford to Upminster</br>']

hovertext_og = ['<b>' + str(day) + '</b>' + actual_text_og[i] for i, day in enumerate(days_og)]

fig1.add_trace(go.Scatter(x=scatter_og.index, y=scatter_og, mode='markers',
                          name='events_og', hovertext=hovertext_og, hoverinfo="text",
                          marker=dict(color="blue", size = [10]*3)))

# Events on DLR
signif_dates_dlr = [dt.date(2011,8,20),dt.date(2015,11,14)]
days_dlr = [dt.date(2011,8,31),dt.date(2015,11,11)]

scatter_dlr = n_journeys.dlr_n[n_journeys.index.isin(signif_dates_dlr)]

actual_text_dlr= ['<br>DLR Extension:</br>Canning Town to Stratford',
                  '<br>Change of fares:</br>From Zone 3 to Zone 2/3']

hovertext_dlr = ['<b>' + str(day) + '</b>' + actual_text_dlr[i] for i, day in enumerate(days_dlr)]

fig1.add_trace(go.Scatter(x=scatter_dlr.index, y=scatter_dlr, mode='markers',
                          name='events_dlr', hovertext=hovertext_dlr, hoverinfo="text",
                          marker=dict(color="green", size = [10]*3)))

# Events on Elizabeth Line
signif_dates_eliz = [dt.date(2018,5,26),dt.date(2019,12,7),dt.date(2022,5,28),dt.date(2023,5,27)]
days_eliz = [dt.date(2018,5,31),dt.date(2019,12,15),dt.date(2022,5,24),dt.date(2023,5,21)]

scatter_eliz = n_journeys.eliz_n[n_journeys.index.isin(signif_dates_eliz)]

actual_text_eliz = ['<br>TfL Rail Extension:</br>Paddington to Heathrow',
               '<br>TfL Rail Extension:</br>Paddington to Reading',
               '<br>TfL Rail Extension:</br>Paddington to Abbey Wood<br>Rebrand to Elizabeth Line</br>',
               '<br>Elizabeth Line Extension:</br>Paddington to  Shenfield<br>Reading to Abbey Wood<br>\
Heathrow to Abbey Wood</br>']

hovertext_eliz = ['<b>' + str(day) + '</b>' + actual_text_eliz[i] for i, day in enumerate(days_eliz)]

fig1.add_trace(go.Scatter(x=scatter_eliz.index, y=scatter_eliz, mode='markers',
                          name='events_eliz', hovertext=hovertext_eliz, hoverinfo="text",
                          marker=dict(color="red", size = [10]*4)))

fig1.update_layout(title_text="Number of Passengers on TfL Network", xaxis_title="Date",
                   yaxis_title="Millions of Passengers", xaxis_range=[dt.date(2010,5,1), dt.date(2023,12,9)],
                   xaxis_rangeslider_visible=True)

fig1.show()

Producing a temporal plot for number of passengers, and comparing modes of transit over the years 2009-2019, we observe TfL rail, buses and Tube seeing a higher volume of travellers, contrasting with the significantly lower number for trams, overground, DLR, and Elizabeth line (previously known as the TFL rail) remaining consistently low throughout the focused extent of time. However, for all transportation modes studied, a sudden decrease in transport journeys in 2020 is observed. The most significant reason for this drop is the unprecedented changes in daily life and travel patterns following the global COVID-19 pandemic. During lockdowns, to restrict spread of the virus, public transport usage plummeted significantly.

To identify the significant events that took place in this time period, the viewer can switch off tfl_n, bus_n, tube_n and events_tube, to see a clearer graph highlighting the temporal changes following main events that have impacted the Overground, DLR and Elizabeth line. Hovering over the event that took place in 2015-05-31, highlighting the addition to the Overground Line, we can see a sharp increase in Overground passengers, following this point. A similar effect followed the event of DLR extension in 2011-08-31. However, an opposing effect occurred when the DLR underwent a change in fares in 2015-11-11, causing a slight reduction in DLR passengers. We also observe a significant jump in TfL rail passengers following the event that took place in 2022-05-24, regarding the TfL rail extension and rebrand to the Elizabeth line.

2. Are there any considerable events that made passengers switch across means of transportation?¶

In [16]:
before2015_bus = n_journeys_m.bus_n[n_journeys_m.index<='2015-06-01'].mean()
after2015_bus = n_journeys_m.bus_n[(n_journeys_m.index>='2015-06-01') & (n_journeys_m.index<'2020-01-01')].mean()
In [17]:
plt.figure(figsize=(10,6))
plt.plot(n_journeys_m.bus_n[n_journeys_m.index<'2020-01-01'].index, n_journeys_m.bus_n[n_journeys_m.index<'2020-01-01'])
x1, y1 = [dt.date(2010,1,1), dt.date(2015,6,1)], [before2015_bus, before2015_bus]
x2, y2 = [dt.date(2015,6,1), dt.date(2020,1,1)], [after2015_bus, after2015_bus]
plt.plot(x1, y1, x2, y2, marker = 'o')
plt.xlabel('Date')
plt.ylabel('Millions of journeys')
plt.title('Number of journeys on the London Buses 2010-2020')
plt.legend(labels = ['Number of journeys on buses', 'Mean before June 2015', 'Mean after June 2015'])
plt.show()

We explore the change in number of bus journeys before and after June 2015, following the introduction of the TfL rail. This was a revolutionary point in the TfL, improving connectivity, faster journeys and offering a more direct route to commuters’ destinations. As explained in the graph, the mean of bus passengers has dropped, showing a decrease of 8 thousand journeys, on average.
Another pivotal change in the TfL network that took place in 2015 was the Overground line was restructured to connect more suburban rail routes, following an aquisition of rail lines acquired from the Greater Anglia services. This shift is prevalent when inspecting the mean of Overground journeys before and after 2015. Again, reasons for this include enhanced service frequency, more optimised routes and in general improved passenger experience.

In [18]:
before2015_og = n_journeys_m.og_n[n_journeys_m.index<='2015-01-01'].mean()
after2015_og = n_journeys_m.og_n[(n_journeys_m.index>='2015-01-01') & (n_journeys_m.index<'2020-01-01')].mean()
In [19]:
plt.figure(figsize=(10,6))
plt.plot(n_journeys_m.og_n[n_journeys_m.index<'2020-01-01'].index, n_journeys_m.og_n[n_journeys_m.index<'2020-01-01'])
x1, y1 = [dt.date(2010,1,1), dt.date(2015,1,1)], [before2015_og, before2015_og]
x2, y2 = [dt.date(2015,1,1), dt.date(2020,1,1)], [after2015_og, after2015_og]
plt.plot(x1, y1, x2, y2, marker = 'o')
plt.xlabel('Date')
plt.ylabel('Millions of journeys')
plt.title('Number of journeys on the London Overground 2010-2020')
plt.legend(labels = ['Number of journeys on OG', 'Mean before beginning of 2015', 'Mean after beginning of 2015'])
plt.show()

3. How has overall trend of transport-related crimes in London changed over years 2009-2023?¶

In [15]:
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=crime['bus_vol'].index, y=crime['bus_vol'],
                          line_color= 'red', name='bus_crime', fill = 'tonexty'))
fig2.add_trace(go.Scatter(x=crime['tube_vol'].index, y=crime['tube_vol'], fill='tozeroy',
                          line_color= 'blue', name='tube_crime'))

fig2.update_layout(title_text="Volume of Crimes on Bus vs Tube", xaxis_title="Date",
                   yaxis_title="Volume of Crimes", xaxis_range=[dt.date(2009,4,30), dt.date(2023,3,31)],
                   xaxis_rangeslider_visible=True)

fig2.show()

Studying a temporal graph of crime volumes across different transport mode throughout the studied time period, we observe a prevalent gap between tube and bus crime. Buses have a higher volume of reported incidents compared to tubes. This may be due to numerous factors, one being the difference in passenger demographics. Bus services tend to attract a more diverse group of riders compared to tube passengers, potentially creating conditions that criminals find conducive for certain activities. Furthermore, the bus network covers a broader range of routes and neighbourhoods compared to the tube network, exposing these journeys to diverse environments with higher crime rates due to socio-economic factors.

Modelling¶

4. What are the key factors influencing the number of TfL journeys?¶

Linear Regression Model¶

We decided to conduct a regression analysis on some key determinants in transport, to predict total number of journeys. Regression helps us understand the relationship between two or more variables, for example whether average crime rate might be associated with variations in the total number of journeys. Predicting journey numbers offers insights into policy making regarding the safety and security of passengers and commuters. This analysis is useful as it serves as a barometer of public safety within the transport network, reflecting the effectiveness of security measures and policing strategies. Regression analysis also provides statistical inference about relationships observed, allowing us to assess the strength and significance, and aid with hypothesis testing to continually improve the TfL services.

In [20]:
# average crime rate with a weighted mean using the volumes as weights
crime_m = crime_m.fillna(0)
avg_crime_rate = pd.DataFrame(index = crime_m.index, columns = ['value'])
for i in range(len(crime)):
    avg_crime_rate.iloc[i] = np.average(crime_m.iloc[i][[ 'bus_rate','tube_rate', 'dlr_rate',
                                'tram_rate', 'og_rate', 'eliz_rate']],
                            weights = crime_m.iloc[i][['bus_vol','tube_vol',  'dlr_vol',
                                'tram_vol',  'og_vol',  'eliz_vol']])
# aggregating monthly to compare
n_journeys_total_m = n_journeys_total.groupby(n_journeys_total.index.to_period('M').to_timestamp()).mean()

We created an average crime rate for each month taking the weighted average of registered crime rates per mode of transport having the volume of crime as weights.

In [21]:
# Preparing the data for the regression
data_regr = pd.concat([avg_crime_rate, n_journeys_total_m], axis = 1).dropna()
data_regr.columns = ['avg_crime_rate','n_journeys_total_m']
X = data_regr.iloc[:,0].astype(float)
y = data_regr.iloc[:,1]
In [22]:
X = sm.add_constant(X)
mod = sm.OLS(y,X)
fit = mod.fit()
fit.summary()
Out[22]:
OLS Regression Results
Dep. Variable: n_journeys_total_m R-squared: 0.572
Model: OLS Adj. R-squared: 0.569
Method: Least Squares F-statistic: 200.2
Date: Thu, 25 Jan 2024 Prob (F-statistic): 2.09e-29
Time: 23:24:33 Log-Likelihood: -775.35
No. Observations: 152 AIC: 1555.
Df Residuals: 150 BIC: 1561.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 423.4630 11.101 38.145 0.000 401.527 445.398
avg_crime_rate -14.8890 1.052 -14.148 0.000 -16.968 -12.810
Omnibus: 12.502 Durbin-Watson: 0.815
Prob(Omnibus): 0.002 Jarque-Bera (JB): 28.373
Skew: 0.251 Prob(JB): 6.90e-07
Kurtosis: 5.056 Cond. No. 36.4


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [23]:
data_regr['year'] = data_regr.index.year
px.scatter(data_regr,x = 'avg_crime_rate', y = 'n_journeys_total_m', 
           color = 'year',trendline='ols', title = 'Average Crime Rate to Number of Journeys (in million)')

We initially run a regression analysis on average crime rate to predict number of journeys. From the analysis we observe the negative coefficient of -14.89, suggesting a steep negative relationship, as average crime rate increases by 1 measurement point, the number of journeys made decreases by 14.89. The coefficients are statistically significant, as the associated p-values are close to zero. This is also visualised in the scatter plot below as the data points have a strong negative line of best fit. The adjusted R2 is 0.569, accounting for the number of predictors in the model, indicating that approximately 56.9% of the variability in the total number of journeys is explained by the model. Standard errors assume that the covariance matrix of the errors is correctly specified, and the covariance type is non-robust.

Performance Tube¶

[n_journeys_tube ~ performance_tube]

In [24]:
data_regr = pd.concat([service_tube_m, n_journeys_m.tube_n], axis = 1).dropna()
X = data_regr['performance']
y = data_regr['tube_n']
In [25]:
X = sm.add_constant(X)
mod=sm.OLS(y,X)
fit = mod.fit()
fit.summary()
Out[25]:
OLS Regression Results
Dep. Variable: tube_n R-squared: 0.008
Model: OLS Adj. R-squared: -0.008
Method: Least Squares F-statistic: 0.5009
Date: Thu, 25 Jan 2024 Prob (F-statistic): 0.482
Time: 23:24:34 Log-Likelihood: -323.24
No. Observations: 67 AIC: 650.5
Df Residuals: 65 BIC: 654.9
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 42.2292 50.610 0.834 0.407 -58.845 143.303
performance 39.2473 55.453 0.708 0.482 -71.499 149.994
Omnibus: 6.575 Durbin-Watson: 0.209
Prob(Omnibus): 0.037 Jarque-Bera (JB): 6.688
Skew: -0.740 Prob(JB): 0.0353
Kurtosis: 2.545 Cond. No. 27.2


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [26]:
data_regr['year'] = data_regr.index.year
px.scatter(data_regr,x = 'performance', y = 'tube_n', color = 'year', trendline='ols', title = 'Performance on Tube to Number of journeys (in million)')

When regressing performance of tube services against number of journeys on the tube, we now witness a positive coefficient of 39.25, which theoretically makes sense, as higher performance should result in more journeys taken. However, since the p-value of this coefficient is relatively high, this indicates that the model is not statistically significant. Similarly, a low negative adjusted R-squared suggests that the model may be too complex for the given data, resulting in a poorer fit. As these factors indicate uncertainty about the true values of the parameters, further investigation of the model may be necessary to improve its explanatory power.

5. Can historical data on TfL journeys be used to predict future passenger numbers accurately, and can we create a model to predict the number of TfL journeys in 2024?¶

Prediction on future values with SARIMA model¶

Using the SARIMA model it is possible to see if a time series has been generated by a particular stochastic process using the historical data for each observation over time. In particular, the SARIMA model takes into consideration the possibility of seasonality inside a time series.

Number of journeys by tube¶

In [27]:
series = n_journeys_m['tube_n'].copy()
px.line(x=series.index, y=series, title="Number of Journeys (in million) on Tube over time")
In [28]:
adfuller(series)[1]
Out[28]:
0.05610300261297535

The p-value for the Adfuller test is higher than 0.05 suggesting that the time series is not stationary and has to be differentiated to be fitted inside the SARIMA model.

In [29]:
plot_acf(series.diff().dropna())
plot_pacf(series.diff().dropna());

After differentiating the time series it is shown that the autocorrelation and partial autocorrelation between different lags is significant only at lag 6 and lag 12 suggesting a 6 month seasonality trend.

We are going to fit the SARIMA model using these informations and test it on the last 12 observation.

In [30]:
n = len(series)
# diving the time series in training and test set
test = series.iloc[n-12:n]
series_train = series.iloc[0:n-12]
In [31]:
model = SARIMAX(series_train, order=(0,1,0), seasonal_order = (3,1,0,6));
model_fit = model.fit();
print(model_fit.summary())
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
plt.show()
# density plot of residuals
residuals.plot(kind='kde')
plt.show()
# summary stats of residuals
print(residuals.describe())
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  3.66157D+00    |proj g|=  3.33587D-02

At iterate    5    f=  3.65916D+00    |proj g|=  1.83189D-02

At iterate   10    f=  3.65823D+00    |proj g|=  3.08725D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4     10     12      1     0     0   3.087D-05   3.658D+00
  F =   3.6582306680083843     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
                                     SARIMAX Results                                     
=========================================================================================
Dep. Variable:                            tube_n   No. Observations:                  149
Model:             SARIMAX(0, 1, 0)x(3, 1, 0, 6)   Log Likelihood                -545.076
Date:                           Thu, 25 Jan 2024   AIC                           1098.153
Time:                                   23:24:34   BIC                           1109.976
Sample:                                        0   HQIC                          1102.957
                                           - 149                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.S.L6       -0.9453      0.092    -10.241      0.000      -1.126      -0.764
ar.S.L12      -0.4982      0.100     -4.986      0.000      -0.694      -0.302
ar.S.L18      -0.3452      0.076     -4.571      0.000      -0.493      -0.197
sigma2       119.9550      7.063     16.983      0.000     106.111     133.799
===================================================================================
Ljung-Box (L1) (Q):                   0.02   Jarque-Bera (JB):               302.20
Prob(Q):                              0.88   Prob(JB):                         0.00
Heteroskedasticity (H):               7.84   Skew:                            -0.87
Prob(H) (two-sided):                  0.00   Kurtosis:                         9.93
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
 This problem is unconstrained.
                0
count  149.000000
mean     0.530390
std     13.505401
min    -51.661748
25%     -5.014194
50%      0.864596
75%      5.094188
max     87.531447

The Ljung-Box test has given an output with a p-value of 0.88 excluding the possibility of significant autocorrelations between lags.

The residuals are not behaving perfectly as a normal distribution because of some extreme observations, in particular during the COVID crisis. Nevertheless, some predictions on the future observations using this model can be made with a good level of caution being this an approximation.

In [32]:
# Rolling forecast of the following months

prevision = []
for i in range(12):
    model = SARIMAX(series_train, order=(0,1,0), seasonal_order = (3,1,0,6)) 
    # the model is applied to the training set that misses the last 12 observations
    model_fit = model.fit(disp=False)
    output = model_fit.forecast() # prediction of the following month
    series_train = pd.concat([series_train,output])
    prevision.append(output.iloc[0])
test = pd.DataFrame(test)
test['prevision'] =  prevision
In [33]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=test.index, y=test['tube_n'],
                    mode='lines',
                    name='Real values'))
fig.add_trace(go.Scatter(x=test.index, y=test['prevision'],
                    mode='lines',
                    name='SARIMA Predictions'))
fig.update_layout(title_text="Test of the SARIMA Model predictions on the number of journeys on Tube",
                  xaxis_title='Date',
                   yaxis_title="Number of journeys on Tube (in millions)")

The predictions made by the SARIMA model tends to be similar to the registrered values. From May 2023, the model tends to overestimate the number of journeys by tube. It is clear that an overestimation is more desirable than a underestimation.

We are now going to use the same model to predict the value of the future 13 months, assuming that the network will remain as it is today.

In [34]:
prevision = pd.Series()
for i in range(13):
    model = SARIMAX(series, order=(0,1,0), seasonal_order = (3,1,0,6)) # same model already tested
    model_fit = model.fit(disp=False)
    output = model_fit.forecast() # prediction of the following month
    series = pd.concat([series,output])
    prevision = pd.concat([prevision,output])
    
prevision.index = pd.Series(
pd.date_range(series.index[-14], periods=13, freq="M"))
prevision_plot = pd.concat([pd.Series(series.iloc[n-1], index = [series.index[n-1]]), prevision])
In [35]:
fig = go.Figure()
fig.update_layout(title='Number of journeys (in millions) on Tube: from 2020, with predictions for 2024',
                  xaxis_title='Date',
                   yaxis_title='N. of journeys')
fig.add_trace(go.Scatter(x=series.iloc[120:n].index, y=series.iloc[120:n],
                    mode='lines',
                    name='Last Registered Values'))
fig.add_trace(go.Scatter(x=prevision_plot.index, y=prevision_plot,
                    mode='lines',
                    name='SARIMA Predictions'))

Number of journeys by bus¶

In [36]:
series = n_journeys_m['bus_n'].copy()
px.line(x=series.index, y=series, title="Number of Journeys on Bus over time")
In [37]:
adfuller(series)[1]
Out[37]:
0.21675283505139398

The p-value for the Adfuller test is higher than 0.05 suggesting that the time series is not stationary and has to be differentiated to be fitted inside the SARIMA model.

In [38]:
plot_acf(series.diff().dropna())
plot_pacf(series.diff().dropna());

After differentiating the time series it is shown that the autocorrelation and partial autocorrelation between different lags is significant at lag 1, 4, 5, 6, 11 and 12 suggesting a 6 month stagionality trend with a dependency on the previous observations, in particular the previous month.

We are going to fit the SARIMA model using these informations and test it on the last 12 observation.

In [39]:
n = len(series)
test = series.iloc[n-12:n]
series_train = series.iloc[0:n-12]
In [40]:
model = SARIMAX(series_train, order=(1,2,1), seasonal_order = (2,1,0,6))
model_fit = model.fit()
print(model_fit.summary())
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
plt.show()
# density plot of residuals
residuals.plot(kind='kde')
plt.show()
# summary stats of residuals
print(residuals.describe())
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            5     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  4.45173D+00    |proj g|=  4.25060D-02

At iterate    5    f=  4.39447D+00    |proj g|=  3.49395D-02

At iterate   10    f=  4.16095D+00    |proj g|=  1.27155D-01

At iterate   15    f=  4.14067D+00    |proj g|=  2.92150D-04

At iterate   20    f=  4.14066D+00    |proj g|=  3.61888D-04
 This problem is unconstrained.
At iterate   25    f=  4.14062D+00    |proj g|=  3.51095D-03

At iterate   30    f=  4.14054D+00    |proj g|=  7.26296D-04

At iterate   35    f=  4.14053D+00    |proj g|=  4.93637D-05

At iterate   40    f=  4.14053D+00    |proj g|=  7.63569D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    5     43     53      1     0     0   2.040D-06   4.141D+00
  F =   4.1405250744718467     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                              bus_n   No. Observations:                  149
Model:             SARIMAX(1, 2, 1)x(2, 1, [], 6)   Log Likelihood                -616.938
Date:                            Thu, 25 Jan 2024   AIC                           1243.876
Time:                                    23:24:37   BIC                           1258.620
Sample:                                         0   HQIC                          1249.868
                                            - 149                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.1275      0.060     -2.112      0.035      -0.246      -0.009
ma.L1         -1.0000     11.399     -0.088      0.930     -23.341      21.342
ar.S.L6       -0.9669      0.075    -12.853      0.000      -1.114      -0.819
ar.S.L12      -0.2084      0.062     -3.351      0.001      -0.330      -0.087
sigma2       336.4330   3825.746      0.088      0.930   -7161.891    7834.757
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):               294.32
Prob(Q):                              0.94   Prob(JB):                         0.00
Heteroskedasticity (H):               5.38   Skew:                             0.13
Prob(H) (two-sided):                  0.00   Kurtosis:                        10.07
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
                0
count  149.000000
mean     0.722974
std     28.405356
min   -133.017853
25%     -7.941313
50%      0.569646
75%      7.867079
max    185.359726

As in the tube's case, the residuals are not behaving perfectly as a normal distribution because of some extreme observations, in particular during the COVID crisis.

In [41]:
prevision = []
for i in range(12):
    model = SARIMAX(series_train, order=(1,2,0), seasonal_order = (2,1,1,6))
    model_fit = model.fit(disp=False)
    output = model_fit.forecast()
    series_train = pd.concat([series_train,output])
    prevision.append(output.iloc[0])
test = pd.DataFrame(test)
test['prevision'] =  prevision
In [42]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=test.index, y=test['bus_n'],
                    mode='lines',
                    name='Real values'))
fig.add_trace(go.Scatter(x=test.index, y=test['prevision'],
                    mode='lines',
                    name='SARIMA Predictions'))
fig.update_layout(title_text="Test of the SARIMA Model predictions on number of journeys by bus", xaxis_title="Months",
                   yaxis_title="Number of journeys by bus (in millions)")

The predictions made by the SARIMA model tends to follow the same trend it has been registered during the last month. Also in this case the model tends to overestimate, but this is still more desirable than a underestimation.

We are now going to use the same model to predict the value of the future 13 months.

In [43]:
prevision = pd.Series()

for i in range(13):
    model = SARIMAX(series, order=(1,2,0), seasonal_order = (2,1,1,6))
    model_fit = model.fit(disp=False)
    output = model_fit.forecast()
    series = pd.concat([series,output])
    prevision = pd.concat([prevision,output])
prevision.index = pd.Series(pd.date_range(series.index[-14], periods=13, freq="M"))
prevision_plot = pd.concat([pd.Series(series.iloc[n-1], index = [series.index[n-1]]), prevision])
In [44]:
fig = go.Figure()
fig.update_layout(title='Number of journeys by bus: from 2020, with predictions for 2024',
                   xaxis_title='Months',
                   yaxis_title='N. of journeys')
fig.add_trace(go.Scatter(x=series.iloc[120:n].index, y=series.iloc[120:n],
                    mode='lines',
                    name='Last Registered Values'))
fig.add_trace(go.Scatter(x=prevision_plot.index, y=prevision_plot,
                    mode='lines',
                    name='SARIMA Predictions'))

Conclusions¶

In conclusion, our analysis of transport in London through data-driven exploration of TfL’s history, data visualisation integrating different transport modes, and regression and SARIMA modelling, has provided valuable insights into the complex dynamics of London's transport network. The project has focused on understanding the factors influencing the number of journeys across TfL's diverse transport modes. Analysing historical and contemporary datasets, exploring patterns associated with public transport usage, crime rates, and performance metrics, we have been able to answer some important research questions surrounding impacts of significant events, determinants influencing journey numbers, changes in transport-related crimes, and the correlation between performance and journeys.

Our exploration revealed the profound impact of the COVID-19 pandemic on transport patterns, with a significant drop in journeys in 2020 due to lockdowns and reduced public transport usage. Crime rates also experienced a spike during this period, likely influenced by shifts in law enforcement priorities and economic uncertainties. Regression analysis further illuminated the relationships between average crime rates, traffic flows, and tube performance with the total number of journeys. The negative relationship between average crime rates and journeys suggested that higher crime rates are associated with fewer journeys, emphasizing the importance of security in promoting public transport use.

Employing the SARIMA model, we were able to predict future values in a time series, focusing on the number of tube journeys. The Adfuller test indicates non-stationarity in the time series, requiring differentiation for SARIMA fitting. Significant autocorrelation and partial autocorrelation at lag 6 and lag 12 suggest a 6-month seasonality trend. The Ljung-Box test excludes significant autocorrelations between lags in the residuals, showing deviation from a normal distribution due to extreme observations during the COVID crisis. Predictions by the SARIMA model are cautiously made, showing similarity to registered values but tending to overestimate, which is preferable. When applying the model to predict the next 13 months after December 2023, an ongoing seasonal trend is emphasised, with dependency of slightly increasing pattern over the last months of 2023.

As this comprehensive analysis has contributed to informed decision-making in urban planning, we can conclude that to combat this potential increase in number of journeys within the TfL following a seasonal trend, enhancement of law enforcement and policy formulation is required for the future of London's transport network. By understanding the intricate connections between various factors influencing journeys, TfL can continue its mission of providing a responsive and adaptive transportation network for the well-being of London's commuters.